“Financial Analysis of Field of Studies - Bachelor Degrees in US”

This is a complement to the basic statistical analysis that was done based on the data downloaded from the US Department of Education collegescorecard web site.

The US department of education , provides the option to download the data sets used in the tool from the following website:

https://collegescorecard.ed.gov/data/

The analsysis uses the data in the file FieldOfStudyData1516_1617_PP.csv; combining it with Institution information to display a financial comparison of programs. The Earning by Field of study data shows the Median Earnings (and Debt) of Students as of 1 year after graduation for the classes 2015_16 anf 2016_17.

The Data has been pre-processed to show only the records for which Earnings and Tuition information is available. The data has 19,563 records with 14 columns.

FieldofStudy_Earnings_for_rep = read.csv("C:/Users/figue/Documents/Collegeboard_data/Field_of_Studies_Earn_rep.csv", stringsAsFactors = FALSE)
dim(FieldofStudy_Earnings_for_rep)
## [1] 19563    14
colnames(FieldofStudy_Earnings_for_rep)
##  [1] "INSTNM"             "MAIN"               "CIPDESC"           
##  [4] "STABBR"             "DEBTMEDIAN"         "MD_EARN_WNE"       
##  [7] "ADM_RATE"           "SAT_AVG"            "Tuition_Average"   
## [10] "Tuition_more_110K"  "Debt_payback_YR"    "Tuit_Avg_4Y_Pyback"
## [13] "Tuit_Inc_4Y_Pyback" "Salary_10Y_4Y_Tui"

In this analysis, we will review how the words contained in a field of study affect the potential earnings. I will also use SAT scores and Admission rates as predictors of future earnings. This data will be used to create linear regression models that will be discussed later in this document.

The reader might be thinking whether the State where the Institution is located might have something to do with the financial outcome. Prior to moving forward , let’s create a simple linear regression model to test this hypothesis.

lm.Ear.cost = lm(Salary_10Y_4Y_Tui ~ STABBR + ADM_RATE +SAT_AVG , data=FieldofStudy_Earnings_for_rep)

summary(lm.Ear.cost)
## 
## Call:
## lm(formula = Salary_10Y_4Y_Tui ~ STABBR + ADM_RATE + SAT_AVG, 
##     data = FieldofStudy_Earnings_for_rep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -360598  -94744  -34884   71607  911919 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  193623.07   64986.16   2.979 0.002892 ** 
## STABBRAL     -69946.69   63268.11  -1.106 0.268934    
## STABBRAR     -71263.85   63754.06  -1.118 0.263672    
## STABBRAZ      -6095.70   63843.42  -0.095 0.923936    
## STABBRCA     -58024.91   62789.80  -0.924 0.355441    
## STABBRCO     -63852.43   63188.43  -1.011 0.312267    
## STABBRCT     -56231.28   63624.63  -0.884 0.376819    
## STABBRDC     -74339.47   64634.13  -1.150 0.250096    
## STABBRDE     -15518.07   66360.61  -0.234 0.815109    
## STABBRFL     -75210.13   62928.31  -1.195 0.232038    
## STABBRGA     -63645.53   63001.20  -1.010 0.312402    
## STABBRHI     -54241.71   64903.56  -0.836 0.403321    
## STABBRIA     -42063.28   63229.22  -0.665 0.505900    
## STABBRID     -35680.55   64027.34  -0.557 0.577351    
## STABBRIL     -50045.53   62899.62  -0.796 0.426253    
## STABBRIN     -28347.04   62998.87  -0.450 0.652745    
## STABBRKS     -26540.53   64013.00  -0.415 0.678432    
## STABBRKY     -59373.37   63256.57  -0.939 0.347944    
## STABBRLA     -66690.71   63346.89  -1.053 0.292455    
## STABBRMA     -53643.40   62992.72  -0.852 0.394460    
## STABBRMD     -54281.10   63395.55  -0.856 0.391884    
## STABBRME     -11604.60   65429.24  -0.177 0.859227    
## STABBRMI     -29412.68   62910.44  -0.468 0.640125    
## STABBRMN     -47874.19   63032.33  -0.760 0.447554    
## STABBRMO     -57315.98   63104.08  -0.908 0.363746    
## STABBRMS     -75282.94   64037.35  -1.176 0.239769    
## STABBRMT     -83519.67   64917.79  -1.287 0.198272    
## STABBRNC     -75089.31   62950.81  -1.193 0.232956    
## STABBRND      23123.61   64910.38   0.356 0.721666    
## STABBRNE     -38356.62   63992.91  -0.599 0.548922    
## STABBRNH     -61505.82   64482.65  -0.954 0.340182    
## STABBRNJ     -50655.22   63005.88  -0.804 0.421423    
## STABBRNM     -51641.68   64654.37  -0.799 0.424456    
## STABBRNV     -28959.93   64816.10  -0.447 0.655025    
## STABBRNY     -74742.45   62802.73  -1.190 0.234019    
## STABBROH     -60960.34   62872.78  -0.970 0.332270    
## STABBROK     -29787.34   63714.38  -0.468 0.640139    
## STABBROR     -76818.42   63392.61  -1.212 0.225612    
## STABBRPA     -50970.14   62774.11  -0.812 0.416826    
## STABBRPR    -251585.61   84853.29  -2.965 0.003032 ** 
## STABBRRI     -56247.71   64140.00  -0.877 0.380526    
## STABBRSC     -95521.38   63224.69  -1.511 0.130853    
## STABBRSD     -30873.04   64626.91  -0.478 0.632862    
## STABBRTN     -72496.08   63129.85  -1.148 0.250836    
## STABBRTX     -34984.68   62795.81  -0.557 0.577455    
## STABBRUT     -49569.47   63852.96  -0.776 0.437580    
## STABBRVA     -63205.35   63029.24  -1.003 0.315976    
## STABBRVI    -129905.10  153446.52  -0.847 0.397241    
## STABBRVT    -108198.31   65250.59  -1.658 0.097298 .  
## STABBRWA     -32416.56   63081.50  -0.514 0.607340    
## STABBRWI     -38163.20   62996.13  -0.606 0.544655    
## STABBRWV     -32993.22   63870.87  -0.517 0.605470    
## STABBRWY     -42190.39   67872.73  -0.622 0.534207    
## ADM_RATE     -30881.73    8221.75  -3.756 0.000173 ***
## SAT_AVG         162.99      11.74  13.882  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 140100 on 15685 degrees of freedom
##   (3823 observations deleted due to missingness)
## Multiple R-squared:  0.03911,    Adjusted R-squared:  0.0358 
## F-statistic: 11.82 on 54 and 15685 DF,  p-value: < 2.2e-16

First of all you can tell that the Linear regression model is not very good. Not too much is explained by the variables chosen, and the only State/Terrirory that is statistical significant is Puerto Rico (STABBRPR). By themselves, the SAT average and the Admission rate are statistical significant but they do not predict a lot (R-squared: 0.03911). We are missing the Field of Study information, this is what matters the most.

Since Puerto Rico is relevant , I will add a variable to keep track of this Territory and I will avoid using the feature STABBR which contains the values for all states.

FieldofStudy_Earnings_for_rep$Puerto_Rico = ifelse(FieldofStudy_Earnings_for_rep$STABBR=="PR",1,0)

Let’s continue by splitting the description of the fields of study into individual words. The tm package in r is very helpful to complete this. There are many different words in the descriptions of the fields of study, it is important to keep the ones that repeat the most , those are the most relevant.

library(tm)
## Loading required package: NLP
#CIPDESC is the description . I would like to analyze

docs = FieldofStudy_Earnings_for_rep$CIPDESC

#Vector Source can create a Corpuse from the vector

Field.Corpus = VCorpus(VectorSource(docs))

#Transformations to normalize data, remove common words and punctuation

#Conver to lower case

Field.Corpus = tm_map(Field.Corpus, content_transformer(tolower))

#there are still dots
#remove stop words

Field.Corpus = tm_map(Field.Corpus, removeWords, stopwords("english"))

#punctuation

Field.Corpus = tm_map(Field.Corpus, removePunctuation)

#Document term matrix

Field.matrix = DocumentTermMatrix(Field.Corpus)

# Convert to normal matrix in r
r.field.matrix = as.matrix(Field.matrix)

#Add the description names to 

rownames(r.field.matrix) = FieldofStudy_Earnings_for_rep$CIPDESC

#find most commong words
count_of_word = apply(r.field.matrix,2,sum)

count_of_word = count_of_word[order(count_of_word,decreasing=TRUE)]

#the most common is general 3,456 times

#there a total of 380 words. How many repeat more than 200 times in degreese

sum(count_of_word>200)
## [1] 72
# 72. Keep those 72


count_of_word  = count_of_word [count_of_word>200]

#now just keep those columns in the r.field.matrix

r.field.matrix.red = r.field.matrix[,names(count_of_word)]

r.field.matrix.red = cbind(r.field.matrix.red, FieldofStudy_Earnings_for_rep[,c("Puerto_Rico","ADM_RATE","SAT_AVG","Salary_10Y_4Y_Tui")])
#change state to Factor
r.field.matrix.red$Puerto_Rico = as.factor(r.field.matrix.red$Puerto_Rico)

colnames(r.field.matrix.red)
##  [1] "general"                 "nursing"                
##  [3] "management"              "services"               
##  [5] "administration"          "studies"                
##  [7] "engineering"             "sciences"               
##  [9] "arts"                    "business"               
## [11] "operations"              "related"                
## [13] "health"                  "education"              
## [15] "research"                "development"            
## [17] "psychology"              "clinical"               
## [19] "computer"                "communication"          
## [21] "professional"            "specific"               
## [23] "teacher"                 "biology"                
## [25] "registered"              "science"                
## [27] "corrections"             "criminal"               
## [29] "justice"                 "media"                  
## [31] "methods"                 "accounting"             
## [33] "levels"                  "information"            
## [35] "marketing"               "language"               
## [37] "literature"              "english"                
## [39] "applied"                 "physical"               
## [41] "social"                  "educationfitness"       
## [43] "humanities"              "liberal"                
## [45] "government"              "political"              
## [47] "sociology"               "human"                  
## [49] "fine"                    "studio"                 
## [51] "resources"               "economics"              
## [53] "finance"                 "financial"              
## [55] "professions"             "public"                 
## [57] "history"                 "mechanical"             
## [59] "communications"          "work"                   
## [61] "areas"                   "subject"                
## [63] "allied"                  "design"                 
## [65] "electrical"              "businesscommerce"       
## [67] "medical"                 "technologiestechnicians"
## [69] "relations"               "administrative"         
## [71] "systems"                 "electronics"            
## [73] "Puerto_Rico"             "ADM_RATE"               
## [75] "SAT_AVG"                 "Salary_10Y_4Y_Tui"

Several of these words are part of a single Field of Study description so they will be perfectly correlated. This is not benefitial for a linear regression (and in general for any machine learning algorithm). The following code removes the fields that are perfectly correlated.

# Using  Apply I can find the counts that are identical across words. By removing duplicates a eliminate perfect correlation
# and remove the columns with duplicated numbers r keep the ones that are unique
colnames(r.field.matrix.red[,c(1:72)])[!duplicated(apply(r.field.matrix.red[,c(1:72)],2,sum))]
##  [1] "general"                 "nursing"                
##  [3] "management"              "services"               
##  [5] "administration"          "studies"                
##  [7] "engineering"             "sciences"               
##  [9] "arts"                    "business"               
## [11] "operations"              "related"                
## [13] "health"                  "education"              
## [15] "research"                "development"            
## [17] "psychology"              "clinical"               
## [19] "computer"                "communication"          
## [21] "professional"            "specific"               
## [23] "biology"                 "registered"             
## [25] "science"                 "corrections"            
## [27] "media"                   "methods"                
## [29] "accounting"              "levels"                 
## [31] "information"             "marketing"              
## [33] "language"                "literature"             
## [35] "english"                 "applied"                
## [37] "physical"                "social"                 
## [39] "educationfitness"        "humanities"             
## [41] "government"              "sociology"              
## [43] "human"                   "fine"                   
## [45] "resources"               "economics"              
## [47] "finance"                 "professions"            
## [49] "public"                  "history"                
## [51] "mechanical"              "communications"         
## [53] "work"                    "areas"                  
## [55] "allied"                  "electrical"             
## [57] "businesscommerce"        "medical"                
## [59] "technologiestechnicians" "relations"              
## [61] "administrative"          "systems"                
## [63] "electronics"
r.field.matrix.red_nocor = r.field.matrix.red[,c(colnames(r.field.matrix.red[,c(1:72)])[!duplicated(apply(r.field.matrix.red[,c(1:72)],2,sum))],"Puerto_Rico","ADM_RATE","SAT_AVG","Salary_10Y_4Y_Tui")]

#Do a linear regression

lm.Ear.cost4 = lm(Salary_10Y_4Y_Tui ~ . , data=r.field.matrix.red_nocor)
summary(lm.Ear.cost4)
## 
## Call:
## lm(formula = Salary_10Y_4Y_Tui ~ ., data = r.field.matrix.red_nocor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -347454  -43401   -2272   37860  823853 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.719e+05  9.533e+03  18.031  < 2e-16 ***
## general                 -3.074e+03  3.483e+03  -0.883 0.377407    
## nursing                  2.180e+05  2.621e+04   8.318  < 2e-16 ***
## management               3.085e+04  5.605e+03   5.503 3.79e-08 ***
## services                -2.607e+04  5.539e+03  -4.707 2.54e-06 ***
## administration           4.883e+04  7.816e+03   6.247 4.29e-10 ***
## studies                 -3.257e+04  3.696e+03  -8.812  < 2e-16 ***
## engineering              2.359e+05  3.315e+03  71.169  < 2e-16 ***
## sciences                -1.629e+04  4.075e+03  -3.999 6.40e-05 ***
## arts                    -1.301e+05  4.619e+03 -28.174  < 2e-16 ***
## business                 2.871e+04  6.667e+03   4.306 1.67e-05 ***
## operations              -4.024e+04  7.587e+03  -5.304 1.15e-07 ***
## related                 -8.184e+03  6.180e+03  -1.324 0.185408    
## health                  -1.352e+04  6.043e+03  -2.237 0.025284 *  
## education                6.934e+03  6.161e+03   1.125 0.260443    
## research                -2.427e+04  9.882e+03  -2.456 0.014056 *  
## development             -6.105e+04  1.374e+04  -4.444 8.89e-06 ***
## psychology              -6.558e+04  4.605e+03 -14.242  < 2e-16 ***
## clinical                -2.138e+04  9.317e+03  -2.295 0.021746 *  
## computer                 1.343e+05  5.191e+03  25.876  < 2e-16 ***
## communication           -4.372e+04  5.555e+03  -7.870 3.77e-15 ***
## professional             4.690e+04  5.656e+04   0.829 0.407036    
## specific                 1.107e+04  5.959e+04   0.186 0.852609    
## biology                 -6.847e+04  4.217e+03 -16.236  < 2e-16 ***
## registered              -5.906e+05  1.060e+05  -5.573 2.54e-08 ***
## science                  1.357e+05  6.516e+03  20.818  < 2e-16 ***
## corrections             -1.109e+04  4.369e+03  -2.538 0.011152 *  
## media                    3.781e+04  6.996e+03   5.404 6.62e-08 ***
## methods                  1.432e+05  1.195e+04  11.987  < 2e-16 ***
## accounting               1.663e+05  8.342e+03  19.939  < 2e-16 ***
## levels                  -1.407e+05  1.342e+04 -10.490  < 2e-16 ***
## information              1.370e+05  6.083e+03  22.524  < 2e-16 ***
## marketing                5.630e+04  4.377e+03  12.862  < 2e-16 ***
## language                -2.571e+04  1.300e+04  -1.978 0.047982 *  
## literature              -5.854e+04  2.281e+04  -2.566 0.010291 *  
## english                  7.705e+03  2.635e+04   0.292 0.769990    
## applied                  9.586e+04  6.018e+03  15.929  < 2e-16 ***
## physical                -4.414e+04  1.592e+04  -2.773 0.005554 ** 
## social                  -3.519e+04  8.799e+03  -4.000 6.38e-05 ***
## educationfitness        -1.126e+03  1.787e+04  -0.063 0.949760    
## humanities               1.488e+05  8.721e+03  17.064  < 2e-16 ***
## government              -1.599e+05  7.644e+03 -20.918  < 2e-16 ***
## sociology               -4.857e+04  4.667e+03 -10.407  < 2e-16 ***
## human                    6.816e+04  1.043e+04   6.534 6.61e-11 ***
## fine                     2.776e+04  6.300e+03   4.406 1.06e-05 ***
## resources               -1.664e+04  9.669e+03  -1.721 0.085197 .  
## economics                9.341e+04  4.797e+03  19.473  < 2e-16 ***
## finance                  1.262e+05  8.786e+03  14.359  < 2e-16 ***
## professions              3.322e+04  1.006e+04   3.300 0.000968 ***
## public                  -1.257e+04  7.234e+03  -1.738 0.082265 .  
## history                 -5.759e+04  4.959e+03 -11.612  < 2e-16 ***
## mechanical               2.010e+04  5.451e+03   3.687 0.000228 ***
## communications          -6.810e+04  1.068e+04  -6.377 1.86e-10 ***
## work                     1.618e+04  1.023e+04   1.581 0.113796    
## areas                           NA         NA      NA       NA    
## allied                   1.562e+05  1.211e+04  12.902  < 2e-16 ***
## electrical              -4.625e+04  1.913e+04  -2.418 0.015609 *  
## businesscommerce         6.806e+04  7.203e+03   9.449  < 2e-16 ***
## medical                 -1.168e+05  2.527e+04  -4.622 3.84e-06 ***
## technologiestechnicians  5.536e+04  7.256e+03   7.630 2.48e-14 ***
## relations               -3.245e+04  7.702e+03  -4.213 2.53e-05 ***
## administrative           2.046e+05  2.704e+04   7.568 4.00e-14 ***
## systems                  4.794e+04  9.947e+03   4.820 1.45e-06 ***
## electronics              1.987e+05  2.304e+04   8.626  < 2e-16 ***
## Puerto_Rico1            -1.868e+05  3.218e+04  -5.806 6.50e-09 ***
## ADM_RATE                -3.084e+04  3.923e+03  -7.860 4.08e-15 ***
## SAT_AVG                  1.030e+02  6.481e+00  15.889  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78600 on 15674 degrees of freedom
##   (3823 observations deleted due to missingness)
## Multiple R-squared:  0.6976, Adjusted R-squared:  0.6964 
## F-statistic: 556.3 on 65 and 15674 DF,  p-value: < 2.2e-16
#areas seem to be highly correlated.Yields singularities Remove.

The word areas seems to be correlated to other variables. There are other words that are highly correlated to ohters that should be removed. In the next lines of code, I check for the correlated variables and remove a few of them manually. In addition to this, I will remove words that are not statistically significant. After this is done a new linear regression is created.

#check correlated variables
#this returns the pairs.
#I need to keep only the top
Corr.m.9 = which(cor(r.field.matrix.red[,1:72])>.9 , arr.ind=TRUE)
#remove diagonal

Corr.m.9=Corr.m.9[!(Corr.m.9[,"row"] == Corr.m.9[,"col"]),]

#bind names that are too correlated

corr_pairs = cbind(rownames(cor(r.field.matrix.red[,1:72]))[Corr.m.9[,"row"]],colnames(cor(r.field.matrix.red[,1:72]))[Corr.m.9[,"col"]])

# Choose which word is more relevant or better describes a degree. The other words will have to be removed
#start with nursing
#for example this is highly correlated to clinical and registerd
#those should be removed
words_remove = corr_pairs[corr_pairs[,1]=="nursing",2]
#for keeping track remove from corr_pairs

corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]
#now look fo teacher. Development, professional, specific
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="teacher",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#justice
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="justice",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#methods
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="methods",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#literature
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="literature",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#educationfitness
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="educationfitness",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#humanities
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="humanities",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#government
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="government",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#fine
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="fine",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#finance
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="finance",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#subject
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="subject",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#electronics
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="electronics",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#medical
words_remove = c(words_remove,corr_pairs[corr_pairs[,1]=="medical",2])
#update pairs
corr_pairs=corr_pairs[!(corr_pairs[,1] %in% words_remove),]

#remove all fields that correspond to words that are highly correlated

r.field.matrix.red_nocor = r.field.matrix.red_nocor[,!(colnames(r.field.matrix.red_nocor) %in% words_remove)]


lm.Ear.cost4 = lm(Salary_10Y_4Y_Tui ~ . , data=r.field.matrix.red_nocor)

#I will remove coefficients that are not significant more than .05

lm.Ear.cost4$coefficients[2]
##   general 
## -942.0453
# retrieve the names of the variables that are significan
names.to.keep = names(summary(lm.Ear.cost4)[["coefficients"]][,"Pr(>|t|)"])[summary(lm.Ear.cost4)[["coefficients"]][,"Pr(>|t|)"]<.05] 

#except for intercept these are names in r.field.matrix.red_nocor. Remember that we also have to keep Salary_10Y_4Y_Tui
#For some reason Puerto_Rico1 is presented as a coefficient and not Puerto Rico
r.field.matrix.red_nocor.sig = r.field.matrix.red_nocor[,c(colnames(r.field.matrix.red_nocor)[colnames(r.field.matrix.red_nocor) %in% names.to.keep], "Puerto_Rico","Salary_10Y_4Y_Tui")]

lm.Ear.cost5 = lm(Salary_10Y_4Y_Tui ~ . , data=r.field.matrix.red_nocor.sig)

summary(lm.Ear.cost5)
## 
## Call:
## lm(formula = Salary_10Y_4Y_Tui ~ ., data = r.field.matrix.red_nocor.sig)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -375654  -43579   -2485   38370  841276 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              164233.13    9369.11  17.529  < 2e-16 ***
## nursing                   71078.79    2890.11  24.594  < 2e-16 ***
## management                62841.63    4950.69  12.694  < 2e-16 ***
## services                 -26759.45    5047.87  -5.301 1.17e-07 ***
## administration            25262.31    7419.83   3.405 0.000664 ***
## studies                  -30808.34    3419.50  -9.010  < 2e-16 ***
## engineering              237182.94    3179.71  74.593  < 2e-16 ***
## sciences                  -7043.00    3558.43  -1.979 0.047806 *  
## arts                    -127543.10    4466.01 -28.559  < 2e-16 ***
## business                  17548.74    6565.85   2.673 0.007531 ** 
## operations               -34121.99    7516.53  -4.540 5.68e-06 ***
## related                  -26407.10    5353.69  -4.933 8.20e-07 ***
## health                   -17694.76    5197.49  -3.404 0.000665 ***
## research                 -21301.51    9082.41  -2.345 0.019021 *  
## psychology               -66147.14    3235.20 -20.446  < 2e-16 ***
## computer                 136475.00    5181.89  26.337  < 2e-16 ***
## communication            -41129.12    5416.27  -7.594 3.29e-14 ***
## biology                  -68032.71    3259.82 -20.870  < 2e-16 ***
## science                  134073.61    6483.52  20.679  < 2e-16 ***
## media                     35245.03    6963.05   5.062 4.20e-07 ***
## methods                   20757.16    3733.20   5.560 2.74e-08 ***
## accounting               188256.94    7116.89  26.452  < 2e-16 ***
## information              128102.31    5867.10  21.834  < 2e-16 ***
## marketing                 59237.12    4300.26  13.775  < 2e-16 ***
## literature               -76359.66    4177.27 -18.280  < 2e-16 ***
## applied                   92570.79    5738.96  16.130  < 2e-16 ***
## social                   -39234.95    8738.41  -4.490 7.17e-06 ***
## educationfitness         -37997.77    6683.12  -5.686 1.33e-08 ***
## humanities               135249.54    7995.36  16.916  < 2e-16 ***
## government              -155544.34    7686.16 -20.237  < 2e-16 ***
## sociology                -45597.19    4591.06  -9.932  < 2e-16 ***
## human                     40227.98    7547.80   5.330 9.97e-08 ***
## fine                      28126.18    6299.20   4.465 8.06e-06 ***
## resources                -18599.24    8655.90  -2.149 0.031671 *  
## economics                 96856.34    4744.96  20.412  < 2e-16 ***
## finance                   97771.59    8242.50  11.862  < 2e-16 ***
## professions               49822.10    9089.81   5.481 4.29e-08 ***
## history                  -54712.65    4896.60 -11.174  < 2e-16 ***
## mechanical                23971.44    5444.75   4.403 1.08e-05 ***
## communications           -64916.90   10705.33  -6.064 1.36e-09 ***
## work                      23360.97   10128.83   2.306 0.021102 *  
## allied                   138937.28   10455.71  13.288  < 2e-16 ***
## businesscommerce          68063.08    6358.83  10.704  < 2e-16 ***
## medical                   75681.77   10513.81   7.198 6.37e-13 ***
## technologiestechnicians   55010.57    6755.76   8.143 4.15e-16 ***
## relations                -37475.40    7157.85  -5.236 1.67e-07 ***
## systems                   30747.48    9742.77   3.156 0.001603 ** 
## electronics              150766.56   12509.14  12.053  < 2e-16 ***
## ADM_RATE                 -30678.64    3950.21  -7.766 8.58e-15 ***
## SAT_AVG                     106.90       6.45  16.575  < 2e-16 ***
## Puerto_Rico1            -189701.84   32405.36  -5.854 4.89e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79220 on 15689 degrees of freedom
##   (3823 observations deleted due to missingness)
## Multiple R-squared:  0.6925, Adjusted R-squared:  0.6915 
## F-statistic: 706.7 on 50 and 15689 DF,  p-value: < 2.2e-16

In this last linear regression all of the features are statistically significant.These features explain close to 70% of the variance in “Net Operating Income” after 10 years (R - squared 0.6925). A few things that can be pointed out are that students graduating in Puerto Rico should expect to earn $189,701 less over 10 years after graduation (if they keep the same 1st year salary) compared to students graduating from Institutions in one of the 50 States in the US.

SAT scores and admission rates matter in terms of earnings , probably because they are factors associated to prestigious Institutions. An extra point in the SAT AVG Score of the Universty translates into US106 in additional earnings over 10 years. To put it in perspective, the student graduating from an institution with an Average SAT score of 1,500 versus an Institution with an Average SAT Score of 1,100 will earn on average about US40,000 more over 10 years . The Admission rate have a similar impact but in the opposite direction, for every drop of one percentage point (0.01) of the admission rate for the institution the expectation would be to earn about US$306 more over 10 years.

Next, I show the density functions for SAT Scores and Admission Rates.

#For the calculation of the density function, I will start with the original data.
#Since the SAT score and admission rate is by Institution then this information repeats for the Fields of Studies offered at the 
#College/University. It is required to remove duplicates

#create Dataframe with required fields
SAT_ADM_INST = FieldofStudy_Earnings_for_rep[,c("INSTNM","ADM_RATE","SAT_AVG")]
#Eliminate duplicates
SAT_ADM_INST= SAT_ADM_INST[!(duplicated(SAT_ADM_INST)),]

# Remove NAs
SAT_ADM_INST=na.omit(SAT_ADM_INST)

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
#Create density chart for SAT first

dens <- density(SAT_ADM_INST$SAT_AVG)
df <- data.frame(x=dens$x, y=dens$y)

#want to find what is the percentile corresponding to an average score of 1500
percentile=ecdf(SAT_ADM_INST$SAT_AVG)
prob_1500 = round(percentile(1500),4)
prob_1500
## [1] 0.9896
probs <- c(0.05, 0.25, 0.5, 0.75, 0.95)
quantiles <- quantile(SAT_ADM_INST$SAT_AVG, prob=probs)
df$quant <- factor(findInterval(df$x,quantiles))

quantile_SAT = paste(round(quantiles,0))
label_quants = paste(names(quantiles),quantile_SAT,sep="\n")

ggplot(df, aes(x,y)) + geom_line() + geom_ribbon(aes(ymin=0, ymax=y, fill=quant)) + scale_x_continuous(breaks=quantiles, labels = label_quants ) + scale_fill_brewer(guide="none") + ggtitle("SAT Scores Density") + xlab("% Percentile \n SAT Score") + ylab("Density / Proportion of Institutions")

#Create density for admission rate


dens <- density(SAT_ADM_INST$ADM_RATE)
df <- data.frame(x=dens$x, y=dens$y)
probs <- c(0.05, 0.25, 0.5, 0.75, 0.95)
quantiles <- quantile(SAT_ADM_INST$ADM_RATE, prob=probs)
df$quant <- factor(findInterval(df$x,quantiles))

quantile_ADM = paste(round(quantiles,2))
label_quants = paste(names(quantiles),quantile_ADM,sep="\n")

ggplot(df, aes(x,y)) + geom_line() + geom_ribbon(aes(ymin=0, ymax=y, fill=quant)) + scale_x_continuous(breaks=quantiles, labels = label_quants ) + scale_fill_brewer(guide="none") + ggtitle("Admission Rate Density") + xlab("% Percentile \n Admission Rate") + ylab("Density / Proportion of Institutions")

The density chart for the SAT Scores shows that an SAT Average score of 1123 is the median for all Institutions in the US. In short, with this score about 50% of the instituation in the US would accept the student. Only 5 % of the institutions in the US report an Average SAT score of 1,398 or more. I did a simple calculation for the equivalent percentile for an SAT Score of 1500, it is very close to 99%. In short,only 1% of the institutions in the US report an Average SAT score of 1500 or more (the maximum possible SAT score is 1600).

The density for the admission rate shows a long tail to the left of the chart. The 5th percentile is 28%, more surprising is that the institutions at the 25th percentile accept 54% of the applicants. There are several factors taken into consideration during the admission process but it looks like students have a good chance to get accepted into the institution of their choice probably excluding the ones with very high SAT score requirements.

The next code will create divergence charts by using the words that were taken into consideration as factors for the Linear regreassion that was calculated earlier.

library(scales)
#This contains all of the estimated coefficients
summary(lm.Ear.cost5)[["coefficients"]][,"Estimate"]
##             (Intercept)                 nursing              management 
##             164233.1332              71078.7898              62841.6339 
##                services          administration                 studies 
##             -26759.4495              25262.3119             -30808.3364 
##             engineering                sciences                    arts 
##             237182.9435              -7042.9977            -127543.1000 
##                business              operations                 related 
##              17548.7412             -34121.9930             -26407.1000 
##                  health                research              psychology 
##             -17694.7623             -21301.5110             -66147.1432 
##                computer           communication                 biology 
##             136474.9971             -41129.1186             -68032.7097 
##                 science                   media                 methods 
##             134073.6082              35245.0269              20757.1586 
##              accounting             information               marketing 
##             188256.9406             128102.3070              59237.1243 
##              literature                 applied                  social 
##             -76359.6616              92570.7886             -39234.9531 
##        educationfitness              humanities              government 
##             -37997.7705             135249.5384            -155544.3415 
##               sociology                   human                    fine 
##             -45597.1951              40227.9828              28126.1844 
##               resources               economics                 finance 
##             -18599.2418              96856.3412              97771.5857 
##             professions                 history              mechanical 
##              49822.0960             -54712.6508              23971.4394 
##          communications                    work                  allied 
##             -64916.8963              23360.9666             138937.2813 
##        businesscommerce                 medical technologiestechnicians 
##              68063.0828              75681.7737              55010.5707 
##               relations                 systems             electronics 
##             -37475.3956              30747.4824             150766.5606 
##                ADM_RATE                 SAT_AVG            Puerto_Rico1 
##             -30678.6437                106.9021            -189701.8429
#will keep only the featuers for words in Fiedls of Study

Word_Fields = summary(lm.Ear.cost5)[["coefficients"]][,"Estimate"][2:48]
# Negative or Positive contribution
Contribution = ifelse(Word_Fields>0, "Positive", "Negative")
#bind 
Word_Fields=as.data.frame(cbind(Word_Fields,Contribution), stringsAsFactors=FALSE)
Word_Fields$Word = rownames(Word_Fields)

#update column Names

colnames(Word_Fields)=c("US_Dollars","Contribution_Direction","Word")

Word_Fields$US_Dollars = as.numeric(Word_Fields$US_Dollars)
#round US dollars

#have to correct the sorting.....
Word_Fields$US_Dollars = round(Word_Fields$US_Dollars,0)
#change words and Contribution to factors
Word_Fields$Word = as.factor(Word_Fields$Word)
Word_Fields$Contribution_Direction = as.factor(Word_Fields$Contribution_Direction)
#sort by dollars

Word_Fields=Word_Fields[order(Word_Fields$US_Dollars,decreasing = FALSE),]
#this is necenssary to conver Word to factor and for it to be sorted correctly

Word_Fields$Word <- factor(Word_Fields$Word, levels = Word_Fields$Word)


Word_Fields$Contribution_Direction <- factor(Word_Fields$Contribution_Direction, levels(Word_Fields$Contribution_Direction)[c(2,1)])

p = ggplot(Word_Fields, aes(x=Word, y=US_Dollars, label=Word)) +
   geom_bar(stat='identity', aes(fill=Contribution_Direction), width=.5) +
   scale_fill_manual(name="Contribution Direction",
                    labels = c("Positive", "Negative"),
                    values = c("Positive"="#00ba38", "Negative"="red")) +  scale_y_continuous(labels=dollar_format()) +
   labs(subtitle="Postivie or Negative Contribution by Word in Bachelor Degree Field of Study'",
         title= "Diverging Bar Plot - US Dollar Amounts") +
   coord_flip()

p

The previous chart provides a quick way to identify the words in the Field of Studies that yield a better return over 10 years. As an example, combining engineering with computer or electronics predicts a better financial outcome by a large margin. On the negative side of the chart, there are a few surprises like Biology. It is likely that students in this type of career delay their higher incomes because they pursue Medicine or Veterinary programs after their undergraduate degrees. As for the word government that seems to be the one that affects the results more negatively, this is combined with political science in the name of the field of study so the net result is around 0.

Eventhough , this is probably a good guideline to make a decision; this should be combined with information about the range of incomes (variance) for a particular field of study. If the variance is large, then it becomes more important to research the best schools for that program to maximize the potential income after graduation.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#I will focus on Finance and Financial Management Services. and Economics
Eco_Business = FieldofStudy_Earnings_for_rep[FieldofStudy_Earnings_for_rep$CIPDESC=="Finance and Financial Management Services." | FieldofStudy_Earnings_for_rep$CIPDESC=="Economics.",c("CIPDESC","ADM_RATE","SAT_AVG","Salary_10Y_4Y_Tui")]

# replace Finance and Finacial Management... with Financeand Economics. with Economics

Eco_Business$CIPDESC = ifelse(Eco_Business$CIPDESC=="Economics.","Economics","Finance")

#change to factor
Eco_Business$CIPDESC = as.factor(Eco_Business$CIPDESC)
#omit any NAs

Eco_Business=na.omit(Eco_Business)

fig <- plot_ly(Eco_Business, x = ~ADM_RATE, y = ~SAT_AVG, z = ~Salary_10Y_4Y_Tui, color = ~CIPDESC, colors = c('#BF382A', '#0C4B8E'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Admission Rate'),
                     yaxis = list(title = 'SAT Avg Score'),
                     zaxis = list(title = 'Net Income over 10Y')))

fig

The Scatter Plot shows how there is more variance for a degree in Economics (In red). A student by choosing the right Institution(s) is likely to maximize the level of income after graduation. As for the degree in Finance, eventhough there is an outlier, the expected income for the majority of the programs is clustered. In short there is more value in looking for good institutions offering Economics majors than Finance majors. For Finance Majors there are only 3 Instutions with top earnings after graduation as for Economics there are about ~20.

Conclusions

The following are takeaways from this and the previous statistical analysis.

  1. The most significant predictor of earnings after graduating from college is the Field of Study chosen and not the Institution from which the student graduates.

  2. Puerto Rico is the only territory / State, where the graduates should expect to earn less after graduation, statistically speaking. To address this situation the students in this territory should plan to move to one of the 50 states, after graduation, to increase their earning potential substantially.

  3. SAT Scores and Instituion Admission rates predict less than 4% of the variance of the earnings.

  4. Once a Field fo Study is chosen, it makes sense to look for Specific Institutions when there is a high degree of varaince in the expected Operating Income. For example studients pursuing careers in Computer Engineering, Computer Science, Economics and Nursing should look into the top institutions in terms of potential operating income.

  5. For degress that yield lowest levels of earnings after graduation, it is of the most importance for the students to attend institutions with the lowest possible costs. Student Loans should be avoided as much as possible when pursuing these degrees.